Learning Check

When you complete this session, you will be able to:

  1. understand the assumptions in SLR;
  2. understand how to critically assessing a regression model;
  3. perform regression diagnostics checking for Simple Linear Regression;
  4. understand outliers and leverage;
  5. perform some simple transformations.

Before you start, load the stored datasets provided for exercises below by running the following chunk (assuming the file “InsulatingData.Rdata” is saved in the same directory as the .Rmd worksheet).

Question 1 The timing of production runs (Sheather, 2009)

The original data are in the form of the time taken (in minutes) for a production run, \(Y\), and the number of items produced, \(X\), for 20 randomly selected orders as supervised by three managers. At this stage we shall only consider the data for one of the managers.

(a). Open the Excel file production.txt. Construct a scatterplot of production run, \(Y\), and the number of items produced, \(X\). Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

production=read.table("production.txt", header=T)
production
prod.lm=lm(RunTime ~ RunSize, data=production) ## the LS line
prod.lm1=lm(RunTime ~ 10*RunSize, data=production)
Error in terms.formula(formula, data = data) : 
  invalid model formula in ExtractVars

The equation of the least squares line of best fit is

\[y = 149.7 + 0.26x.\]

The intercept is 149.7, which is where the line of best fit crosses the run time axis. The intercept in the model has the following interpretation: for any production run, the average set up time is 149.7 minutes.

The slope of the line is 0.26. Thus, we say that each additional unit to be produced is predicted to add 0.26 minutes to the run time.

The value of \(R^2= 0.7302\) which implies that 73.02% of the information (variation) in Run Time is explained by the least squares regression line, suggesting that the model is a good model.

(b). Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function rstudent (or rstandard) to first standardize the residuals.

First, we manually check the normality through histograms and QQ-plots of residuals and standardised residuals.

names(prod.lm)  ## checking the output of the fitted model, residuals
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"           
 [8] "df.residual"   "xlevels"       "call"          "terms"         "model"        
res=prod.lm$residuals
std.res=rstandard(prod.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to check normality
hist(res)
hist(std.res)
qqnorm(res)
qqline(res)
qqnorm(std.res)
qqline(std.res)
par(mfrow=c(1,1))

From the above histograms and QQ-plots, the normality is satisfied.

par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(production$RunSize,res, xlab="Run Size", ylab="Residuals") # Residuals vs x
plot(production$RunSize,std.res,  xlab="Run Size", ylab="Standardised Residuals")

 - The next residual analysis is to check about randomness (independence) and constant variance. We plot both
 residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
 the residual are random scattered with no pattern whatsoever.

 - The bottom left plot shows the residuals against Run Size to check randomness (no pattern) and constant
 variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no
 pattern) and constant variance. Both plots confirm randomness and constant variance.

(c). Using the fitted model.

par(mfrow=c(2,2)) # to view 2x2 plots
plot(prod.lm)

plot(prod.lm1)

The figure above provides diagnostic plots produced by R when the command plot(prod.lm) is used, where prod.lm is the result of the “lm” command.

+ The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. 

+ The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is almost flat (i.e constant) to confirm randomness and constant variance.

+ The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line, confirming Normality.

+ The bottom right plot of standardized residuals against leverage enables one
to readily identify any ‘bad’ leverage points. 

   - Recall that the rule for simple linear regression for classifying a point as a leverage
   point is if leverage $h_{ii} > 4/ n.$ As $n=20$, the rule is $h_{ii} > 0.2.$ As leverage $h_{ii}$ in the above
   bottom right is less than 0.2, then there is no leverage points.

   - Recall that we classify points as outliers if their standardized residuals have absolute value greater than
   2. All the residuals are within (-2,2), therefore no outliers being detected.

Question 2 The invoices data (Sheather, 2009)

The manager of the purchasing department of a large company would like to develop a regression model to predict the average amount of time it takes to process a given number of invoices. Over a 30-day period, data are collected on the number of invoices processed and the total time taken (in hours). The data are available in the file invoices.txt.

Complete the following tasks.

(a). Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

invoices=read.table("invoices.txt", header=T)
View(invoices)
inv.lm=lm(Time ~ Invoices, data=invoices)
plot(Time ~ Invoices, data=invoices)
abline(inv.lm)

summary(inv.lm)

Call:
lm(formula = Time ~ Invoices, data = invoices)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59516 -0.27851  0.03485  0.19346  0.53083 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.6417099  0.1222707   5.248 1.41e-05 ***
Invoices    0.0112916  0.0008184  13.797 5.17e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3298 on 28 degrees of freedom
Multiple R-squared:  0.8718,    Adjusted R-squared:  0.8672 
F-statistic: 190.4 on 1 and 28 DF,  p-value: 5.175e-14

The equation of the least squares line of best fit is

\[ Time= 0.642 + 0.011 Invoices .\]

The intercept is 0.642, which is not interpretable for no invoices.

The slope of the line is 0.011. Thus, we say for that for each additional invoice, the average amount of time it takes to process increases by 0.011 hour.

(b). Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function rstudent (or rstandard) to first standardize the residuals.

First, we check the normality through histograms and QQ-plots of residuals and standardised residuals.

names(inv.lm)  ## checking the output of the fitted model, residuals
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
 [6] "assign"        "qr"            "df.residual"   "xlevels"       "call"         
[11] "terms"         "model"        
res=inv.lm$residuals
std.res=rstandard(inv.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to checkk normality
hist(res)
hist(std.res)
qqnorm(res)
qqline(res)
qqnorm(std.res)
qqline(std.res)
par(mfrow=c(1,1))

The histograms show bimodalities. From the QQ-plots, the normality seems to be satisfied, except 1 point at the top and 1 point at the bottom respectively.

par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals")
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(res, invoices$Invoices, xlab="Invoices", ylab="Residuals")
plot(std.res, invoices$Invoices,  xlab="Invoices", ylab="Standardised Residuals")

The next residual analysis is to check about randomness (independence) and constant variance. We plot both residuals and standardised residuals against time (top left and top right). From these plots we can see a bit of increasing pattern, small residuals within 0-15 and then larger residuals within 15-30.

The bottom left plot shows the residuals against Invoices to check randomness (no pattern) and constant variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. Both plots show a bit of pattern between negative and positive Invoices.

(c). Perform diagnostic checking for the fitted model in (a) using “plot(file.lm).” Interpret the outputs.

par(mfrow=c(2,2))
plot(inv.lm)

(d). Use the function influence.measures to explore measures of leverage and Cook’s distance.

# check help function first
Inv.infl<-influence.measures(inv.lm)
Inv.infl
Influence measures of
     lm(formula = Time ~ Invoices, data = invoices) :
#head(Inv.infl$infmat)

Points increase in influence to the extent that they lie on their own, a long way from the mean of \(x\) and the bulk of the data. The column “hat” in the above is the commonest measure of this leverage and is defined by:

\[h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum(x_j -\bar{x})^2} .\]

Cook’s distance (the column “cook.d” in the above) is one measure of influence: for each point, it combines the size of its residual along with a measure of the leverage. An approximate cutoff is \(4/(n-2)\), but in practice it is important to look for gaps in values of Cook’s distance instead of just whether or not the values exceed the cutoff.

There are 2 functions which can also be used to calculate these measures directly: hatvalues and cooks.distance.

Leverage.inv<-hatvalues(inv.lm)
Cooks.Dist.inv<-cooks.distance(inv.lm)
Inv <- invoices$Invoices
# plotting the influence measures against their x value - Invoices 
n=30
plot(Leverage.inv~Inv)
abline(h=4/n, col=3)
plot(Cooks.Dist.inv~Inv)
abline(h=4/(n-2), col=2)
# plotting the influence measures against their x value - Invoices 
n=30
plot(Leverage.inv~Inv)
abline(h=4/n, col=3)

plot(Cooks.Dist.inv~Inv)
abline(h=4/(n-2), col=2)

sort(Leverage.inv)
        28         26          1         11         24         25         16         15 
0.03348526 0.03395333 0.03554890 0.03580511 0.03783426 0.04002682 0.04268500 0.04470347 
        12         29          9          7          3         17         22         21 
0.04695762 0.04875109 0.04933167 0.05065542 0.05402803 0.05548070 0.05552997 0.05775210 
         2         13          5         27          6         20          8         10 
0.06354063 0.06354063 0.06435114 0.06435114 0.06529058 0.07786621 0.08542440 0.09496328 
        30         18         14          4         23         19 
0.09620163 0.09863069 0.10127820 0.10389038 0.10917143 0.18897091 

Question 3 (From Weisberg, S. (2005). Applied Linear Regression, 3rd edition. New York: Wiley)

In this question, you will be learning the R code in more detail to provide better visualisation which may be needed for your project. Furthermore, the ANOVA is explained in more detail following the lecture (part c).

The R data file Wind.RData contains a data frame (wm1) containing wind speed data (in m/s) at two sites: a reference site (Rspd) and a candidate site (CSpd). Data were collected every 6 hours for the year 2002 except in the month of May. The objective is to be able to determine whether wind speed at the reference site, which has permanent wind speed monitoring equipment, can be used to predict the wind speed at this candidate site in the future in order to determine whether to place a small wind farm there.

sort(Cooks.Dist.inv)
          10           16           20            8            9           14 
0.0002035386 0.0005353096 0.0010595498 0.0011024438 0.0032903108 0.0033298164 
          17           28           29            4            1           18 
0.0036431399 0.0045459002 0.0059479355 0.0061181410 0.0088295724 0.0090501347 
           7            5           24           25           11            6 
0.0092145747 0.0150936105 0.0174505677 0.0201627039 0.0260821401 0.0302309933 
          21           12           22           26           19           27 
0.0308161815 0.0341277200 0.0387338505 0.0423707984 0.0502287555 0.0510541088 
           3            2           15           13           23           30 
0.0599026174 0.0770062759 0.0797769665 0.0898041145 0.1207897984 0.1525784387 
  1. Draw a scatterplot of the response CSpd against the predictor RSpd and label it appropriately.
par(pty = "s") # A graphical parameter that sets the PlotTYpe to "square"

# The syntax of the plot statement says "plot CSpd against RSpd but extract them 
# from the data frame wm1. The xlim and ylim arguments are the same - the range 
# of all the data. 
plot(CSpd ~ RSpd, data = wm1, 
     xlab = "Wind speed at reference site (m/s)", 
     ylab = "Wind speed at candidate site (m/s)",
     main = "Wind speeds at reference and candidate sites",
     xlim = range(c(wm1$CSpd, wm1$RSpd)), ylim = c(range(c(wm1$CSpd, wm1$RSpd)))) 


nrow(wm1)
[1] 1116

Fit a simple linear model to these data, and present the appropriate regression summaries. Plot the fitted line onto the plot in (a).

# Load the workshop data here
load("Wind.RData")
nrow('Wind.RData')
NULL

To plot the fitted line onto an existing plot in R or RStudio, you only need to invoke the abline command (from the console) that we’ve used already and that’s shown below. Here, however, because I want to produce a separate plot from the one above, I have to re-plot the data using the above commands and then invoke abline.

# Notice the similarity with the plot statement above
Wind.lm <- lm(CSpd ~ RSpd, data = wm1)
# Here are the contents of the object Wind.lm:
names(Wind.lm)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
 [6] "assign"        "qr"            "df.residual"   "xlevels"       "call"         
[11] "terms"         "model"        
# and here's some basic output:
Wind.lm

Call:
lm(formula = CSpd ~ RSpd, data = wm1)

Coefficients:
(Intercept)         RSpd  
     3.1412       0.7557  
# A bit more comes from the summary() function:
summary(Wind.lm)

Call:
lm(formula = CSpd ~ RSpd, data = wm1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7877 -1.5864 -0.1994  1.4403  9.1738 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.14123    0.16958   18.52   <2e-16 ***
RSpd         0.75573    0.01963   38.50   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.466 on 1114 degrees of freedom
Multiple R-squared:  0.5709,    Adjusted R-squared:  0.5705 
F-statistic:  1482 on 1 and 1114 DF,  p-value: < 2.2e-16
  1. Construct the four diagnostic plots that we discussed in the lecture. Do you think the simple linear model is a good fit? What improvements might you suggest?
par(pty = "s")
plot(CSpd ~ RSpd, data = wm1, 
     xlab = "Wind speed at reference site (m/s)", 
     ylab = "Wind speed at candidate site (m/s)",
     main = "Wind speeds at reference and candidate sites",
     xlim = range(c(CSpd, RSpd)), ylim = c(range(c(CSpd, RSpd)))) 
# abline() takes two arguments: a slope and an intercept. We can
# extract these from the object Wind.lm (which is a list) by 
# using the $ operator, e.g., Wind.lm$coefficients (try it!) and
# using that as an argument to abline():
abline(Wind.lm$coef) # or simply abline(Wind.lm)

The interpretation:

     - Plot 1 - Residuals vs Fitted: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.

     - Plot 2 - Scale - Location: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.

     - Plot 3 - QQ plot of standardised residuals: while most of the points are on the straight line, some points on the upper part are off the lines. Normality might not be satisfied.

     - Plot 4 - Residuals vs Leverage: As $n=1116$ then the cut-off for Cook's distance is $D_i > 4/(n-2)=0.0036.$ We can see that many of the standardised residuals with the leverage greater than 0.04. 
     

Question 4: House selling price data

A set of data on the selling price \(Y\) (in $10,000) and annual taxes \(X\) (in $1,000) for 24 houses is available available online, {housesellingprice.txt}.

  1. Assuming that a simple linear regression model is appropriate, obtain the least squares fit relating selling price to taxes paid. What is the estimate of \(\sigma^2\)?

Read data in

par(mfrow=c(2,2))  # Places 4 graphs together on the same plotting region
plot(Wind.lm)

Simple Linear regression fit

house = read.table("housesellingprice.txt",header=T)
summary(house)
     sales           taxes      
 Min.   :25.90   Min.   :3.891  
 1st Qu.:29.90   1st Qu.:5.057  
 Median :33.70   Median :5.974  
 Mean   :34.61   Mean   :6.405  
 3rd Qu.:38.15   3rd Qu.:7.873  
 Max.   :45.80   Max.   :9.142  

\(\widehat\beta_0=13.3202\), \(\widehat\beta_1=3.3244\)

house.lm<-lm(sales~taxes,data=house)
summary(house.lm)

Call:
lm(formula = sales ~ taxes, data = house)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8343 -2.3157 -0.3669  1.9787  6.3168 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.3202     2.5717   5.179 3.42e-05 ***
taxes         3.3244     0.3903   8.518 2.05e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.961 on 22 degrees of freedom
Multiple R-squared:  0.7673,    Adjusted R-squared:  0.7568 
F-statistic: 72.56 on 1 and 22 DF,  p-value: 2.051e-08

\(\widehat\sigma^2={Residual MS}=8.77\)

  1. Find the mean selling price given that the taxes paid are \(7.50\).
anova(house.lm)
Analysis of Variance Table

Response: sales
          Df Sum Sq Mean Sq F value    Pr(>F)    
taxes      1 636.16  636.16  72.556 2.051e-08 ***
Residuals 22 192.89    8.77                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

the mean selling price \(=38.25296\) that the taxes paid are \(X_0=7.50\).

  1. Construct a graph of \(Y\) versus \(X\). Then add the fitted line, the 95% confidence interval of the mean response, and the 95% prediction interval of the new actual values on the graph.
predict(house.lm,newdata=data.frame(taxes=7.5),interval="confidence",level=0.95)
       fit      lwr      upr
1 38.25296 36.71776 39.78816
  1. Perform diagnostic checking for the fitted model in (a) using “plot(file.lm).” Interpret the outputs.
with(house,plot(sales~taxes,type="p"))
with(house,lines(fitted(house.lm)~taxes))
new = data.frame(taxes=seq(3,10,.1))
CIs = predict(house.lm,new,interval="confidence")
PIs = predict(house.lm,new,interval="predict")
matpoints(new$taxes,CIs,lty=c(1,2,2),col=c("black","red","red"),type="l")
matpoints(new$taxes,PIs,lty=c(1,2,2),col=c("black","blue","blue"),type="l")

par(mfrow=c(2,2))
plot(house.lm)

# check help function first
house.lev<-hatvalues(house.lm)
house.lev
         1          2          3          4          5          6          7          8 
0.08009597 0.07494802 0.10189804 0.10097003 0.07310359 0.15145530 0.04613071 0.05281317 
         9         10         11         12         13         14         15         16 
0.04744471 0.06286388 0.04197728 0.04511789 0.07355860 0.10057696 0.04314772 0.07471120 
        17         18         19         20         21         22         23         24 
0.16214717 0.04466605 0.06413614 0.14091382 0.04346584 0.10811699 0.09396602 0.17177488 
    - Recall that the rule for simple linear regression for classifying a point as a leverage
    point is if leverage $h_{ii} > 4/ n.$ As $n=24$, the rule is $h_{ii} > 0.167.$ As leverage $h_{ii}$ in the  
    above output shows that one point (observation no 24) with $h_{ii}=0.17177 > 0.167$ , then there is one
    leverage point.

    - Recall that we classify points as outliers if their standardized residuals have
    absolute value greater than 2. There is one observation (no 15) with the standard residual is greater than  
    2. However observation number 24, the leverage point,  with standard residual less than 2 is a good
    leverage point.
    
    

Question 5 Bid data (Sheather, 2009)

A building maintenance company is planning to submit a bid on a contract to clean corporate offices scattered throughout an office complex. The costs incurred by the maintenance company are proportional to the number of cleaning crews needed for this task. Recent data are available for the number of rooms that were cleaned by varying numbers of crews. For a sample of 53 days, records were kept of the number of crews used and the number of rooms that were cleaned by those crews. The data can be found in the file cleaning.txt.

The aim is to develop a regression equation to model the relationship between the number of rooms cleaned and the number of crews.

Complete the following tasks.

(a). Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

std.res.h=rstandard(house.lm)
#std.res.h
sort(std.res.h)
         14           1          23          11           8           4          22 
-1.36539093 -1.32680727 -1.23490307 -1.09307472 -1.05824009 -0.91550077 -0.79182376 
         12           7           3           2          19           5          13 
-0.77096776 -0.70102563 -0.18618750 -0.17949347 -0.17219414 -0.08436379 -0.03798314 
         10          17          16          20          24           9           6 
 0.19524305  0.19658735  0.45732239  0.70746820  0.77548911  1.10892170  1.33622425 
         18          21          15 
 1.47498387  1.50462011  2.18088723 
bid=read.table("cleaning.txt", header=T)
head(bid)

The equation of the least squares line of best fit is

\[Rooms = 1.78 + 3.70 Crews.\]

The intercept is \(1.78 \approx 2\) which is where the line of best fit crosses the Rooms axis. The intercept in the model has the following interpretation: for Crews=0, the average of number of cleaned room is 2.

The slope of the line is \(3.70 \approx 4\). Thus, we say that for each additional crew, it is predicted to add 4 rooms are being cleaned.

(b). Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function rstudent (or rstandard) to first standardize the residuals.

First, we manually check the normality through histograms and QQ-plots of residuals and standardised residuals.

bid.lm=lm(Rooms ~ Crews, data=bid) ## the LS line
plot(Rooms ~ Crews, data=bid, xlab="Number of Crews", ylab="Number of Rooms")
abline(bid.lm)

summary(bid.lm)

Call:
lm(formula = Rooms ~ Crews, data = bid)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.9990  -4.9901   0.8046   4.0010  17.0010 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.7847     2.0965   0.851    0.399    
Crews         3.7009     0.2118  17.472   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.336 on 51 degrees of freedom
Multiple R-squared:  0.8569,    Adjusted R-squared:  0.854 
F-statistic: 305.3 on 1 and 51 DF,  p-value: < 2.2e-16

From the above histograms and QQ-plots, the normality is satisfied.

res.bid=bid.lm$residuals
std.res.bid=rstandard(bid.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to check normality
hist(res.bid)
hist(std.res.bid)
qqnorm(res.bid)
qqline(res.bid)
qqnorm(std.res.bid)
qqline(std.res.bid)
par(mfrow=c(1,1))

 - The next residual analysis is to check about randomness (independence) and constant variance. We plot both
 residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
 the residual are random scattered with no pattern whatsoever. However, there seems to show an increasing trend for variances.

 - The bottom left plot shows the residuals against Number of Crews to check randomness (no pattern) and constant variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. 

- It is clear that  the variability in the standardized residuals tends to increase  with the number of crews. Thus, the assumption that the variance of the errors is constant appears to be violated in this case. 

(c). Perform diagnostic checking for the fitted model in (a) using “plot(file.lm).” Interpret the outputs.

par(mfrow=c(2,2))
plot(res.bid, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res.bid,xlab="Time", ylab="Standardised Residuals")
plot(bid$Crews,res.bid, xlab="Number of Crews", ylab="Residuals") # Residuals vs x
plot(bid$Crews,std.res.bid,  xlab="Number of Crews", ylab="Standardised Residuals")

   -  The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity.  The smoothing curve shows a reasonable linear relationship and no pattern.

   - The bottom left provides the squared root of standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is an increasing trend which means that constant variance may not be satisfied.

    - The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.

    -  The bottom right plot of standardized residuals against leverage enables one to readily identify any ‘bad’ leverage points (ie large leverage and large standardised residuals) 

(d). As the data on each axis are in the form of counts and are measured in the same units, the square root transformation of both the predictor variable and the response variable should be implemented. Apply these transformations and repeat (a).

par(mfrow=c(2,2))
plot(bid.lm)

(e). Perform diagnostic checking for the fitted model in (d) using “plot(file.lm).” Interpret the outputs.

bid.sq.lm=lm(sqrt(Rooms) ~ sqrt(Crews), data=bid) ## the LS line
plot(sqrt(Rooms) ~ sqrt(Crews), data=bid, xlab="Square Root of Number of Crews", ylab="Square Root of Number of Rooms")
abline(bid.sq.lm)

summary(bid.sq.lm)

Call:
lm(formula = sqrt(Rooms) ~ sqrt(Crews), data = bid)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.09825 -0.43988  0.06826  0.42726  1.20275 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.2001     0.2757   0.726    0.471    
sqrt(Crews)   1.9016     0.0936  20.316   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.594 on 51 degrees of freedom
Multiple R-squared:   0.89, Adjusted R-squared:  0.8879 
F-statistic: 412.7 on 1 and 51 DF,  p-value: < 2.2e-16
     - After the transformation, the bottom left-hand plot further demonstrates the 
     benefit of the square root transformation in terms of stabilizing the error term.  
     
     - Thus, taking the square root of both the  x  and the  y  variables has stabilized
     the variance of the random errors and hence produced a valid model.  

(f). Predict the number of rooms that can be cleaned by 4 crews and by 16 crews and its 95% confidence intervals using the models in (a) and (d) respectively.

Using (a)

par(mfrow=c(2,2))
plot(bid.sq.lm)

Using (d)

conf.pred0=predict(bid.lm,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
conf.pred0
       fit      lwr      upr
1 16.58827  1.58941 31.58713
2 60.99899 45.81025 76.18773
   - We can see that the prediction interval based on the transformed data is narrower 
   than that  based on the untransformed data when the number of crews is 4 (7.78, 
   27.21) vs (1.58, 31.59) and wider when the number of crews is 16 [(43.33, 81.55) is 
   wider than (45.81, 76.19)]. 

   - This is expected as the original scale the data have variance which increases as 
   the  x -variable increases which means that realistic prediction intervals will
   get wider as the  x -variable increases.  

  - In summary, ignoring nonconstant variance in the raw data from this example led to 
  invalid prediction intervals. 
  
  
---
title: "STAT2401 Analysis of Experiments"
author: '**Practical Week 6: Solutions**'
#date: "Practical 2"
output:
  html_document:
    highlight: haddock
    # number_sections: yes
    theme: flatly
    toc: yes
  html_notebook:
    highlight: haddock
    # number_sections: yes
    theme: flatly
  pdf_document:
    toc: yes
  word_document:
    highlight: tango
    toc: yes
---


```{r echo=FALSE}
knitr::opts_chunk$set(prompt=FALSE, comment=NA, tidy=TRUE, error=TRUE, warning=FALSE, message=FALSE)
```


## Learning Check

When you complete this session, you will be able to:

1. understand the assumptions in SLR;
2. understand how to critically assessing a regression model;
3. perform regression diagnostics checking for Simple Linear Regression;
4. understand outliers and leverage;
5. perform some simple transformations.


Before you start, load the stored datasets provided for exercises below by running the following chunk (assuming the file  "InsulatingData.Rdata" is saved in the same directory as the .Rmd  worksheet).

## Question 1 The timing of production runs (Sheather, 2009)

The original data are in the form of the time taken (in minutes) for a production run, $Y$, and the number of items produced,
$X$, for 20 randomly selected orders as supervised by three managers. At this stage we shall only consider the data for one of the managers.

(a). Open the Excel file *production.txt*. Construct a scatterplot of production run, $Y$, and the number of items produced, $X$. Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

```{r}
production=read.table("production.txt", header=T)
production
```

```{r}
prod.lm=lm(RunTime ~ RunSize, data=production) ## the LS line
prod.lm1=lm(RunTime ~ RunSize, data=production)
plot(RunTime ~ RunSize, data=production, xlab="Run Size", ylab="Run Time")
abline(prod.lm)
summary(prod.lm)
```

The equation of the least squares line of best fit is

$$y = 149.7 + 0.26x.$$

The intercept is 149.7, which is where the line of best fit crosses the run time axis. The
intercept in the model has the following interpretation: for any production run, the
average set up time is 149.7 minutes.

The slope of the line is 0.26. Thus, we say that each
additional unit to be produced is predicted to add 0.26 minutes to the run time. 

The value of $R^2=  0.7302$ which implies that 73.02% of the information (variation) in Run Time is explained by the least squares regression line, suggesting that the model is a good model.

(b).  Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function `rstudent` (or `rstandard`) to first standardize the residuals.


First, we manually check the *normality through histograms and QQ-plots of residuals and standardised residuals.*

```{r}
names(prod.lm)  ## checking the output of the fitted model, residuals
res=prod.lm$residuals
std.res=rstandard(prod.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to check normality
hist(res)
hist(std.res)
qqnorm(res)
qqline(res)
qqnorm(std.res)
qqline(std.res)
par(mfrow=c(1,1))
```

From the above histograms and QQ-plots, the normality is satisfied.

```{r}
par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(production$RunSize,res, xlab="Run Size", ylab="Residuals") # Residuals vs x
plot(production$RunSize,std.res,  xlab="Run Size", ylab="Standardised Residuals")
```

     - The next residual analysis is to check about randomness (independence) and constant variance. We plot both
     residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
     the residual are random scattered with no pattern whatsoever.

     - The bottom left plot shows the residuals against Run Size to check randomness (no pattern) and constant
     variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no
     pattern) and constant variance. Both plots confirm randomness and constant variance.

(c). Using the fitted model.

```{r}
par(mfrow=c(2,2)) # to view 2x2 plots
plot(prod.lm)
plot(prod.lm1)
```

The figure above provides diagnostic plots produced by R when the command `plot(prod.lm)` is used, where `prod.lm` is the result of the “lm” command. 

    + The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. 

    + The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is almost flat (i.e constant) to confirm randomness and constant variance.

    + The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line, confirming Normality.

    + The bottom right plot of standardized residuals against leverage enables one
    to readily identify any ‘bad’ leverage points. 

       - Recall that the rule for simple linear regression for classifying a point as a leverage
       point is if leverage $h_{ii} > 4/ n.$ As $n=20$, the rule is $h_{ii} > 0.2.$ As leverage $h_{ii}$ in the above
       bottom right is less than 0.2, then there is no leverage points.

       - Recall that we classify points as outliers if their standardized residuals have absolute value greater than
       2. All the residuals are within (-2,2), therefore no outliers being detected.


## Question 2 The invoices data (Sheather, 2009)

The manager of the purchasing department of a large company would like to
develop a regression model to predict the average amount of time it takes to process a given number of invoices. Over a 30-day period, data are collected on the number of invoices processed and the total time taken (in hours). The data are available in the file *invoices.txt.* 

Complete the following tasks.

(a). Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

```{r}
invoices=read.table("invoices.txt", header=T)
View(invoices)
```

```{r}
inv.lm=lm(Time ~ Invoices, data=invoices)
plot(Time ~ Invoices, data=invoices)
abline(inv.lm)
summary(inv.lm)
```

The equation of the least squares line of best fit is

$$ Time= 0.642 + 0.011 Invoices .$$

The intercept is 0.642, which is not interpretable for no invoices.

The slope of the line is 0.011. Thus, we say for that for each
additional invoice, the average amount of time it takes to process increases by 0.011 hour.

(b). Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function `rstudent` (or `rstandard`) to first standardize the residuals.

First, we check the normality through histograms and QQ-plots of residuals and standardised residuals.

```{r}
names(inv.lm)  ## checking the output of the fitted model, residuals
res=inv.lm$residuals
std.res=rstandard(inv.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to checkk normality
hist(res)
hist(std.res)
qqnorm(res)
qqline(res)
qqnorm(std.res)
qqline(std.res)
par(mfrow=c(1,1))
```

The histograms show bimodalities. From the QQ-plots, the normality seems to be satisfied, except 1 point at the top and 1 point at the bottom respectively.

```{r}
par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals")
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(res, invoices$Invoices, xlab="Invoices", ylab="Residuals")
plot(std.res, invoices$Invoices,  xlab="Invoices", ylab="Standardised Residuals")
```

The next residual analysis is to check about randomness (independence) and constant variance. We plot both residuals and standardised residuals against time (top left and top right). From these plots we can see a bit of increasing pattern, small residuals within 0-15 and then larger residuals within 15-30.

The bottom left plot shows the residuals against Invoices to check randomness (no pattern) and constant variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. Both plots show a bit of pattern between negative and positive Invoices.

(c). Perform diagnostic checking for the fitted model in (a) using "plot(file.lm)." Interpret the outputs.


```{r}
par(mfrow=c(2,2))
plot(inv.lm)
```


+ The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. 

+ The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is a curve which means that independence and constant variance may not be satisfied.

+ The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.

+ The bottom right plot of standardized residuals against leverage enables one
to readily identify any ‘bad’ leverage points. 

       - Recall that the rule for simple linear regression for classifying a point as a leverage
       point is if leverage $h_{ii} > 4/ n.$ As $n=30$, the rule is $h_{ii} > 0.13.$ As leverage $h_{ii}$ in the above
       bottom right plot has one point with $h_{ii} > 0.13$ , then there is one leverage point.

      - Recall that we classify points as outliers if their standardized residuals have
      absolute value greater than 2. All the residuals are within (-2,2), therefore no outliers being detected. The
      above leverage point is a `good` leverage point.

(d). Use the function `influence.measures` to explore measures of leverage and Cook's distance.   

```{r}
# check help function first
Inv.infl<-influence.measures(inv.lm)
Inv.infl
#head(Inv.infl$infmat)
```

Points increase in influence to the extent that they lie on their own, a long way from the mean of $x$ and the bulk of the data. The column "hat" in the above is the commonest measure of this leverage and is defined by:

$$h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum(x_j -\bar{x})^2} .$$


Cook’s distance (the column "cook.d" in the above) is one measure of influence: for each point, it combines the size of its residual along with a measure of the leverage. An approximate cutoff is $4/(n-2)$, but in practice it is important to look for gaps in values of Cook’s distance instead of just whether or not the values exceed the cutoff.

There are 2 functions which can also be used to calculate these measures directly: `hatvalues` and `cooks.distance`.

```{r}
Leverage.inv<-hatvalues(inv.lm)
Cooks.Dist.inv<-cooks.distance(inv.lm)
Inv <- invoices$Invoices
```


```{r, fig3, fig.width = 5, fig.asp = .5}
# plotting the influence measures against their x value - Invoices 
n=30
plot(Leverage.inv~Inv)
abline(h=4/n, col=3)
plot(Cooks.Dist.inv~Inv)
abline(h=4/(n-2), col=2)
```

- Recall that we classify points as outliers if their standardized residuals have
absolute value greater than 2. All the residuals are within (-2,2), therefore no outliers being detected. The above leverage point is a `good` leverage point.

- Cooks Distance identifies one potential outlier, observation number 30. However, it is important to look for gaps in values of Cook’s distance instead of just whether or not the values exceed the cutoff.



```{r}
sort(Leverage.inv)
```

```{r}
sort(Cooks.Dist.inv)
```


## Question 3 (From Weisberg, S. (2005). *Applied Linear Regression*, 3rd edition. New York: Wiley)

In this question, you will be learning the R code in more detail to provide better visualisation which may be needed for your project. Furthermore, the ANOVA is explained in more detail following the lecture (part c).

The *R* data file `Wind.RData` contains a data frame (`wm1`) containing wind speed data (in m/s) at two sites: a reference site (`Rspd`) and a candidate site (`CSpd`). Data were collected every 6 hours for the year 2002 except in the month of May. The objective is to be able to determine whether wind speed at the reference site, which has permanent wind speed monitoring equipment, can be used to predict the wind speed at this candidate site in the future in order to determine whether to place a small wind farm there.

```{r}
# Load the workshop data here
load("Wind.RData")
nrow('Wind.RData')
```


a. Draw a scatterplot of the response `CSpd` against the predictor `RSpd` and label it appropriately. 

```{r PlotData, fig.height=7}
par(pty = "s") # A graphical parameter that sets the PlotTYpe to "square"

# The syntax of the plot statement says "plot CSpd against RSpd but extract them 
# from the data frame wm1. The xlim and ylim arguments are the same - the range 
# of all the data. 
plot(CSpd ~ RSpd, data = wm1, 
     xlab = "Wind speed at reference site (m/s)", 
     ylab = "Wind speed at candidate site (m/s)",
     main = "Wind speeds at reference and candidate sites",
     xlim = range(c(wm1$CSpd, wm1$RSpd)), ylim = c(range(c(wm1$CSpd, wm1$RSpd)))) 

nrow(wm1)
```

Fit a simple linear model to these data, and present the appropriate regression summaries. Plot the fitted line onto the plot in (a).

```{r}
# Notice the similarity with the plot statement above
Wind.lm <- lm(CSpd ~ RSpd, data = wm1)
# Here are the contents of the object Wind.lm:
names(Wind.lm)
# and here's some basic output:
Wind.lm

# A bit more comes from the summary() function:
summary(Wind.lm)
```

  *To plot the fitted line onto an existing plot in R or RStudio, you only need to invoke the `abline` command (from the console) that we've used already and that's shown below. Here, however, because I want to produce a separate plot from the one above, I have to re-plot the data using the above commands and then invoke `abline`.*
  
```{r fig.height=7}
par(pty = "s")
plot(CSpd ~ RSpd, data = wm1, 
     xlab = "Wind speed at reference site (m/s)", 
     ylab = "Wind speed at candidate site (m/s)",
     main = "Wind speeds at reference and candidate sites",
     xlim = range(c(CSpd, RSpd)), ylim = c(range(c(CSpd, RSpd)))) 
# abline() takes two arguments: a slope and an intercept. We can
# extract these from the object Wind.lm (which is a list) by 
# using the $ operator, e.g., Wind.lm$coefficients (try it!) and
# using that as an argument to abline():
abline(Wind.lm$coef) # or simply abline(Wind.lm)
```  


b. Construct the four diagnostic plots that we discussed in the lecture. Do you think the simple linear model is a good fit? What improvements might you suggest?

```{r}
par(mfrow=c(2,2))  # Places 4 graphs together on the same plotting region
plot(Wind.lm)
```

The interpretation:

         - Plot 1 - Residuals vs Fitted: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.

         - Plot 2 - Scale - Location: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.

         - Plot 3 - QQ plot of standardised residuals: while most of the points are on the straight line, some points on the upper part are off the lines. Normality might not be satisfied.

         - Plot 4 - Residuals vs Leverage: As $n=1116$ then the cut-off for Cook's distance is $D_i > 4/(n-2)=0.0036.$ We can see that many of the standardised residuals with the leverage greater than 0.04. 
         

## Question 4: House selling price data

A set of data on the selling price $Y$ (in \$10,000) and annual taxes $X$ (in \$1,000) for 24
houses is available available online, {housesellingprice.txt}.


a. Assuming that a simple linear regression model is appropriate,
obtain the least squares fit relating selling price to taxes paid.
What is the estimate of $\sigma^2$?


Read data in

```{r size="scriptsize"}
house = read.table("housesellingprice.txt",header=T)
summary(house)
```

Simple Linear regression fit

```{r size="scriptsize"}
house.lm<-lm(sales~taxes,data=house)
summary(house.lm)
```

$\widehat\beta_0=13.3202$, $\widehat\beta_1=3.3244$

```{r size="scriptsize"}
anova(house.lm)
```

$\widehat\sigma^2={Residual MS}=8.77$


b. Find the mean selling price given that the taxes paid are $7.50$.


```{r size="scriptsize"}
predict(house.lm,newdata=data.frame(taxes=7.5),interval="confidence",level=0.95)
```

the mean selling price $=38.25296$ that the taxes paid are $X_0=7.50$.


c. Construct a graph of $Y$ versus $X$. Then add the fitted line,
the 95\% confidence interval of the mean response, and the 95\%
prediction interval of the new actual values on the graph.


```{r size="scriptsize"}
with(house,plot(sales~taxes,type="p"))
with(house,lines(fitted(house.lm)~taxes))
new = data.frame(taxes=seq(3,10,.1))
CIs = predict(house.lm,new,interval="confidence")
PIs = predict(house.lm,new,interval="predict")
matpoints(new$taxes,CIs,lty=c(1,2,2),col=c("black","red","red"),type="l")
matpoints(new$taxes,PIs,lty=c(1,2,2),col=c("black","blue","blue"),type="l")
```


d. Perform diagnostic checking for the fitted model in (a) using “plot(file.lm).” Interpret the outputs.

```{r}
par(mfrow=c(2,2))
plot(house.lm)
```


+ The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. 
The smoothing curve shows a reasonable linear relationship and no pattern.

+ The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is a curve which means that independence and constant variance may not be satisfied.

+ The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.

+ The bottom right plot of standardized residuals against leverage enables one
to readily identify any ‘bad’ leverage points. 

```{r}
# check help function first
house.lev<-hatvalues(house.lm)
house.lev
```
```{r}
std.res.h=rstandard(house.lm)
#std.res.h
sort(std.res.h)
```

        - Recall that the rule for simple linear regression for classifying a point as a leverage
        point is if leverage $h_{ii} > 4/ n.$ As $n=24$, the rule is $h_{ii} > 0.167.$ As leverage $h_{ii}$ in the  
        above output shows that one point (observation no 24) with $h_{ii}=0.17177 > 0.167$ , then there is one
        leverage point.

        - Recall that we classify points as outliers if their standardized residuals have
        absolute value greater than 2. There is one observation (no 15) with the standard residual is greater than  
        2. However observation number 24, the leverage point,  with standard residual less than 2 is a good
        leverage point.
        
        
## Question 5 Bid data (Sheather, 2009)

A building maintenance company is planning to submit a bid on a contract to clean corporate
offices scattered throughout an office complex. The costs incurred by the maintenance company are
proportional to the number of cleaning crews needed for this task. Recent data are available for the number
of rooms that were cleaned by varying numbers of crews. For a sample of 53 days, records were kept of the 
number of crews used and the number of rooms that were cleaned by those crews.
The data can be found in the file *cleaning.txt*.

The aim is to develop a regression equation to model the relationship between the 
number of rooms cleaned and the number of crews.

Complete the following tasks.

(a). Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

```{r}
bid=read.table("cleaning.txt", header=T)
head(bid)
```

```{r}
bid.lm=lm(Rooms ~ Crews, data=bid) ## the LS line
plot(Rooms ~ Crews, data=bid, xlab="Number of Crews", ylab="Number of Rooms")
abline(bid.lm)
summary(bid.lm)
```

The equation of the least squares line of best fit is

$$Rooms = 1.78 + 3.70 Crews.$$


The intercept is $1.78 \approx 2$ which is where the line of best fit crosses the Rooms axis. The
intercept in the model has the following interpretation: for Crews=0, the
average of number of cleaned room is 2.

The slope of the line is $3.70 \approx 4$. Thus, we say that for each
additional crew, it is predicted to add 4 rooms are being cleaned. 

(b). Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function `rstudent` (or `rstandard`) to first standardize the residuals.

First, we manually check the *normality through histograms and QQ-plots of residuals and standardised residuals.*

```{r}
res.bid=bid.lm$residuals
std.res.bid=rstandard(bid.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to check normality
hist(res.bid)
hist(std.res.bid)
qqnorm(res.bid)
qqline(res.bid)
qqnorm(std.res.bid)
qqline(std.res.bid)
par(mfrow=c(1,1))
```

From the above histograms and QQ-plots, the normality is satisfied.

```{r}
par(mfrow=c(2,2))
plot(res.bid, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res.bid,xlab="Time", ylab="Standardised Residuals")
plot(bid$Crews,res.bid, xlab="Number of Crews", ylab="Residuals") # Residuals vs x
plot(bid$Crews,std.res.bid,  xlab="Number of Crews", ylab="Standardised Residuals")
```

     - The next residual analysis is to check about randomness (independence) and constant variance. We plot both
     residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
     the residual are random scattered with no pattern whatsoever. However, there seems to show an increasing trend for variances.

     - The bottom left plot shows the residuals against Number of Crews to check randomness (no pattern) and constant variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. 
  
    - It is clear that  the variability in the standardized residuals tends to increase  with the number of crews. Thus, the assumption that the variance of the errors is constant appears to be violated in this case. 

(c). Perform diagnostic checking for the fitted model in (a) using "plot(file.lm)." Interpret the outputs.

```{r}
par(mfrow=c(2,2))
plot(bid.lm)
```

       -  The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity.  The smoothing curve shows a reasonable linear relationship and no pattern.

       - The bottom left provides the squared root of standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is an increasing trend which means that constant variance may not be satisfied.

        - The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.

        -  The bottom right plot of standardized residuals against leverage enables one to readily identify any ‘bad’ leverage points (ie large leverage and large standardised residuals) 

(d). As the  data  on each axis are  in the form of counts and are measured in the same units, the square 
root transformation of both the predictor variable and the response variable should be implemented. Apply 
these transformations and repeat (a).


```{r}
bid.sq.lm=lm(sqrt(Rooms) ~ sqrt(Crews), data=bid) ## the LS line
plot(sqrt(Rooms) ~ sqrt(Crews), data=bid, xlab="Square Root of Number of Crews", ylab="Square Root of Number of Rooms")
abline(bid.sq.lm)
summary(bid.sq.lm)
```

(e). Perform diagnostic checking for the fitted model in (d) using "plot(file.lm)." Interpret the outputs.


```{r}
par(mfrow=c(2,2))
plot(bid.sq.lm)
```


         - After the transformation, the bottom left-hand plot further demonstrates the 
         benefit of the square root transformation in terms of stabilizing the error term.  
         
         - Thus, taking the square root of both the  x  and the  y  variables has stabilized
         the variance of the random errors and hence produced a valid model.  


(f). Predict the number of rooms that can be cleaned by 4 crews and by 16 crews and its 95% confidence intervals using the models in (a) and (d) respectively.

Using (a)

```{r}
conf.pred0=predict(bid.lm,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
conf.pred0
```

Using (d)

```{r}
conf.pred0.sq=predict(bid.sq.lm,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
conf.pred0.sq^2  ## as this was square root
```

       - We can see that the prediction interval based on the transformed data is narrower 
       than that  based on the untransformed data when the number of crews is 4 (7.78, 
       27.21) vs (1.58, 31.59) and wider when the number of crews is 16 [(43.33, 81.55) is 
       wider than (45.81, 76.19)]. 

       - This is expected as the original scale the data have variance which increases as 
       the  x -variable increases which means that realistic prediction intervals will
       get wider as the  x -variable increases.  

      - In summary, ignoring nonconstant variance in the raw data from this example led to 
      invalid prediction intervals. 
      
      